A Modern Applied Statistics Sampler

Susan VanderPlas

Statistical Analyst


Nebraska Public Power District

Introduction

Background Information

  • PhD from Iowa State in 2015
    • Dissertation: The Perception of Statistical Graphics
    • RA with USDA - Soybean Genetics
  • Postdoc: Office of the Vice President for Research
    Faculty startup packages and career success
  • Statistical Analyst at Nebraska Public Power District
    • Started an internal data science training program
    • Analytics projects
    • Worked with IT to set up data science infrastructure
  • Consultant - Made Shiny applications for Iowa Soybean Assoc. and an Agronomy lab at ISU.

Outline

  • NPPD Projects
    • Tornado Guided Missiles
    • Employee Turnover Prediction
    • Assessing Compensation
  • USDA Soybean Genetics

Tornado Guided Missiles

Background

  • Preventing “beyond design basis” accidents
  • FLEX mods: standardized emergency equipment to power cooling systems
  • Each plant needs to have two sets of FLEX equipment on site

Tornado Guided Missiles

  • Reactor containment can withstand a direct hit from an airplane, tornado, or hurricane
  • Tornado resistant FLEX equipment
    • NRC must be convinced that placement is safe
    • Building distance (tornado width)
    • Angle between buildings relative to tornado path

FLEX Placement

FLEX Placement

Tornado Characteristics in Nebraska

  • Goal: Quantify characteristics of tornadoes within a certain distance of the plant
    • help the engineers defend equipment placement to the NRC
  • NRC used National Weather Service database of tornadoes used to create regulatory guidelines
    • covers 1950-2003
    • “probably available in PDF or microfilm form somewhere”
  • Need a report in the next 4 hours

Tornado Characteristics

NOAA Tornado dataset:

  • all recorded tornadoes between 1950 and 2015
  • 60114 separate tornado segments
  • path width
  • start/end coordinates
  • intensity

Assumption: Tornadoes travel in straight lines

Tornado Characteristics

Tornado Characteristics

Tornado Characteristics

Tornado Characteristics

Lessons Learned

  • Visualizations are often more effective than tables, raw statistics, or other more complicated methods
    • Generally faster
    • Less intimidating
  • In some contexts, polar charts may not be 100% awful
    • Utilize familiar contexts (map compass) to make data more relatable
    • Remove extra details (e.g. y-axis) to draw attention to the important part of the graph
  • If the regulator isn’t convinced, wait for a big windstorm to come along to prove your point.

HR Turnover Prediction

Who’s going to leave?

  • NPPD has a very low turnover rate (usually < 8%/year)
  • Most employees retire after 20+ years
  • Training costs are high



Goal: Predict which individuals are likely to leave/retire

Why do people leave?

  • Career advancement
  • Dislike Nebraska/rural life
  • Better opportunities elsewhere
  • Problems with management/coworkers
  • Retirement
  • Two-body problem
  • Family reasons

Why do people leave?

  • Career advancement
  • Dislike Nebraska/rural life (maybe)
  • Better opportunities elsewhere (maybe)
  • Problems with management/coworkers (maybe)
  • Retirement
  • Two-body problem
  • Family reasons

Available Data

  • Salary information, Years of service
  • Race, Gender, Age
  • Number of dependents
  • Education level
  • Work location
  • Job description
    (security, engineering, operations, maintenance)
  • Birthplace

I'm out gif

Model Errors

  • Identify individuals with a high probability of leaving who have not yet left
  • Perfect prediction isn’t reasonable -
    too many factors that don’t have corresponding data
  • Failure to predict someone leaving is the only good metric to go on
  • How to prove that someone with a 90% prob of leaving won’t quit tomorrow?

Method

3 Random Forest Models

  1. Resignation probability (ignore retirement)
  2. Probability of leaving (resignation or retirement)
  3. Predict resignation, retirement, or stay in a single model

If any model predicts someone has \(p > 0.5\) of leaving, examine more closely

  • HR can intervene and possibly resolve any issues
  • Succession planning - prepare to hire/train replacements for key positions

Outcome

  • Uncanny predictions
  • Currently missing some highly trained individuals who leave for better opportunities
  • Approximately 67% accuracy rate over the past 6 months (67% of people who left were predicted to leave by at least one model)
  • Early retirement program influence

Lessons

  • Databases never have all of the information you want for problems like this
    • Some of it would probably be illegal to use
  • Spelling is the bane of my existence
    • 26 ways to spell “Columbus, NE”
  • Not all model errors are bad
    • We’re depending on errors to identify individuals likely to leave
    • Changing interpretation of model errors does make model validation interesting

Lessons

  • Cost/benefit:
    • Surprise resignation: very bad
    • Talking to someone who has no intention of leaving: slightly awkward?

Assessing Compensation

Lilly Ledbetter Act & Equal Pay Act

  • Require employers to make a good faith effort to detect & address discrepancies in pay based on race or gender
  • “Similarly Situated Employee Groups” (SSEG) - ensure like jobs are compared
  • Goal: Regularly audit compensation records and address any discrepancies that are discovered
  • Significant results -> manual review

NPPD’s workforce

> 80% male, > 97% white

Most potential SSEGs have one or two minority members (at best)

Approach

  • Compare 2016 total compensation
  • Methods:
    • Overall regression (all SSEGs) with gender*SSEG interaction + other covariates
    • Individual regressions for SSEGs flagged by overall regression, using significant covariates
    • Randomization tests for all SSEGs by gender and ethnicity
      • no covariates
      • can detect effects with small #s of women or minorities

Results

<Censored>

Issues

  • Compensation changes + position changes
    • Comparing previous position’s compensation based on new position SSEG
  • Not everyone is salaried; some individuals may request more overtime
    • Potential gender bias/self-selection
  • Linear Regression - unestimable coefficients
    • Not using the regression for prediction
    • Follow up overall regression with individual regressions that aren’t rank-deficient
  • Multiple testing
    • Goal is to identify any potential problems
    • More work for HR

Interactive Visualization of Soybean Population Genetic Data

Big Data Problems

  • “Needle in a Haystack”
    Finding one interesting thing in 100+ GB of data

  • “Needle in a stack of needles”
    100 interesting things - how to investigate them all?

Big Data paralysis

Big Data

Visualization is an important tool for working with big data

Adaptations must be made:

  • Overplotting (large \(n\))
  • High-dimensional data (large \(p\))
  • Distributed/multi-source data, hierarchical data
  • No solution (binning, dimension reduction, tours) works for every situation

Interactive Graphics

  • Provide additional information in response to user action

  • Simultaneously show more than 2-3 variables and their relationship (multiple linked plots)

  • Accommodate complex data structures

BUT…


Web-based interactive graphics may be even more size-sensitive than static graphics.

Interactive Visualization of Soybean Population Genetic Data

Soybean Project: People and Institutions

Overall Project Goals:

  • Understand historical yield increases
    100% increase in past 100 years; additional 70% increase by 2050 to meet food needs (World Bank)
  • Associate genetic features with phenotypic traits Disease resistance, yield, nutritional content, time to maturity

  • Communicate analysis results intuitively:
    • Target: Soybean farmers, plant geneticists
    • Provide full results (tables) and graphical summaries
    • Interface with existing databases and web resources

Data


  • Sequencing Data
    (79 varieties, 75GB processed and compressed)

  • Field Trials
    (168 varieties, 30 varieties with genetic data)

  • New crosses with highest yield varieties
    (sequencing + field trials)

  • Genealogy as reported in the breeding literature
    (1600 varieties)

Visualizing SNPs

  • SNP: Single Nucleotide Polymorphism
    a single basepair mutation (A -> T, G -> A, C -> G)
  • Shiny applet: Responsive applet for user-directed data subsets
  • Show multiple levels of detail (less detail = lower computational load)
  • Provide resources in the applet for user exploration (not just a reference tool)

Visualizing SNPs:

  • Huge number of interesting genes (70 million ID’d SNPs)

Visualizing SNPs:

  • Huge number of interesting genes (70 million ID’d SNPs)
  • 79 varieties, 20 chromosomes
  • Phenotype and genealogy information
  • Researchers tend to work on gene subsets:
    Must be able to zoom and filter
  • Optimized files for SNP results are still large (10 GB) and require significant computational resources

Above all, need an interface to allow people to pull new discoveries from the data systematically.

Applet Design

SNP Population Distribution

SNP Density

Density of SNPs: Chromosome Level

SNP Applet Overview

Individual SNPs: Comparing Varieties

Variety-Level SNP Browser

Genealogy and Phenotypes

SNP Linked Plots